Sample adaptive offset (sao) filtering in video coding

ABSTRACT

A method for sample adaptive offset (SAO) filtering of largest coding units (LCUs) of a video frame in an SAO component is provided that includes receiving, by the SAO component, an indication that deblocked pixel blocks of an LCU are available, and applying SAO filtering, by the SAO component, to each pixel block of pixel blocks of an SAO processing area corresponding to the LCU responsive to the indication, wherein pixels of each pixel block of the SAO processing area are filtered in parallel.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a Continuation of application Ser. No. 14/279,318filed May 16, 2014, which claims benefit of U.S. Provisional PatentApplication Ser. No. 61/825,286, filed May 20, 2013, which isincorporated herein by reference in its entirety.

BACKGROUND OF THE INVENTION

Field of the Invention

Embodiments of the present invention generally relate to sample adaptiveoffset (SAO) filtering in video coding.

Description of the Related Art

The Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T WP3/16and ISO/IEC JTC 1/SC 29/WG 11 has developed the next-generation videocoding standard referred to as High Efficiency Video Coding (HEVC).Similar to previous video coding standards such as H.264/AVC, HEVC isbased on a hybrid coding scheme using block-based prediction andtransform coding. First, the input signal is split into rectangularblocks that are predicted from the previously decoded data by eithermotion compensated (inter) prediction or intra prediction. The resultingprediction error is coded by applying block transforms based on aninteger approximation of the discrete cosine transform, which isfollowed by quantization and coding of the transform coefficients.

In a coding scheme that uses block-based prediction, transform coding,and quantization, some characteristics of the compressed video data maydiffer from the original video data. For example, discontinuitiesreferred to as blocking artifacts can occur in the reconstructed signalat block boundaries. Further, the intensity of the compressed video datamay be shifted. Such intensity shift may also cause visual impairmentsor artifacts. To help reduce such artifacts in decompressed video, theHEVC standard defines two in-loop filters: a deblocking filter to reduceblocking artifacts and a sample adaptive offset filter (SAO) to reducedistortion caused by intensity shift. These filters may be appliedsequentially, and, depending on the configuration, the SAO filter may beapplied to the output of the deblocking filter. This in-loop filteringis one of most computationally intensive parts of the decoding processand may be approximately 15-20% of the overall decoding complexity.

SUMMARY

Embodiments of the present invention relate to methods and apparatus forsample adaptive offset (SAO) filtering in video decoding. In one aspect,a method for sample adaptive offset (SAO) filtering of largest codingunits (LCUs) of a video frame in an SAO component is provided thatincludes receiving, by the SAO component, an indication that deblockedpixel blocks of an LCU are available, and applying SAO filtering, by theSAO component, to each pixel block of pixel blocks of an SAO processingarea corresponding to the LCU responsive to the indication, whereinpixels of each pixel block of the SAO processing area are filtered inparallel.

In one aspect, an apparatus for sample adaptive offset (SAO) filteringis provided that includes a memory, a controller coupled to the memoryand configured to sequence loading of pixel blocks of an SAO processingarea into the memory, filtering of the pixel blocks by a filter engine,and storing of the filtered pixel blocks, wherein the SAO processingarea corresponds to a largest coding unit (LCU) of a video frame, andwherein the loading, filtering, and storing is performed responsive toan indication that deblocked pixel blocks of the LCU are available, andthe filter engine coupled to the controller and the memory, wherein thefilter engine is configured to apply SAO filtering to a pixel block ofthe SAO, processing area stored in the memory, wherein all pixels in thepixel block are filtered in parallel.

BRIEF DESCRIPTION OF THE DRAWINGS

Particular embodiments will now be described, by way of example only,and with reference to the accompanying drawings:

FIG. 1 is an example illustrating band offset (BO) classification insample adaptive offset (SAO) filtering;

FIG. 2 is an example illustrating edge offset (EO) classificationpatterns in SAO filtering;

FIG. 3 is an example illustrating edge types by EO category;

FIG. 4 is a block diagram of an SAO filter architecture;

FIG. 5 is an example illustrating the SAO processing area of a largestcoding unit (LCU);

FIG. 6 is a flow diagram of one method of SAO filtering;

FIG. 7 is a flow diagram of another method of SAO filtering;

FIG. 8 illustrates an example frame divided into 32×2 LCUs;

FIG. 9 illustrates three work buffers stored in a work memory, an LCUdivided into pixel blocks, and an SAO processing area associated withthe LCU, also divided into pixel blocks;

FIG. 10 illustrates a detailed pixel block filtering order for the SAOprocessing area of FIG. 9;

FIG. 11A illustrates the content of a filter block for L11;

FIG. 11B illustrates the content of a filter block for A11;

FIG. 11C illustrates the content of a filter block for A12;

FIG. 11D illustrates the content of a filter block for A21;

FIG. 11E illustrates the content of a filter block for L12;

FIG. 12 illustrates an SAO processing area divided into 16×16sub-processing areas;

FIG. 13 is a conceptual illustration of a three stage pipelinedfiltering process;

and

FIG. 14 is a block diagram of the filter engine of the SAO filterarchitecture.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Specific embodiments of the invention will now be described in detailwith reference to the accompanying figures. Like elements in the variousfigures are denoted by like reference numerals for consistency.

As used herein, the term “picture” may refer to a frame or a field of aframe. A frame is a complete image captured during a known timeinterval. For convenience of description, embodiments are describedherein in reference to HEVC. One of ordinary skill in the art willunderstand that embodiments of the invention are not limited to HEVC.

In HEVC, a largest coding unit (LCU) is the base unit used forblock-based coding. Note that an LCU may also be called a coding treeunit (CTU) in some documents. A picture is divided into non-overlappingLCUs. That is, an LCU plays a similar role in coding as the macroblockof H.264/AVC, but it may be larger, e.g., 32×32, 64×64, etc. An LCU maybe partitioned into coding units (CU). A CU is a block of pixels withinan LCU and the CUs within an LCU may be of different sizes. Thepartitioning is a recursive quadtree partitioning. The quadtree is splitaccording to various criteria until a leaf is reached, which is referredto as the coding node or coding unit. The maximum hierarchical depth ofthe quadtree is determined by the size of the smallest CU (SCU)permitted. The coding node is the root node of two trees, a predictiontree and a transform tree. A prediction tree specifies the position andsize of prediction units (PU) for a coding unit. A transform treespecifies the position and size of transform units (TU) for a codingunit. A transform unit may not be larger than a coding unit and the sizeof a transform unit may be, for example, 4×4, 8×8, 16×16, and 32×32. Thesizes of the transforms units and prediction units for a CU aredetermined by the video encoder during prediction based on minimizationof rate/distortion costs.

The current released version of HEVC is described in the followingdocument, which is incorporated by reference herein: “ITU-Trecommendation K265: High Efficiency Video Coding”, TelecommunicationStandardization Sector of International Telecommunication Union (ITU-T),April, 2013 (“HEVC Standard”).

As previously mentioned, a sample adaptive offset (SAO) in-loop filteris one of the in-loop filters included in the HEVC standard. Thesein-loop filters are applied in the encoder and the decoder. A high leveldescription of SAO is provided herein. A more detailed description maybe found, for example, in the HEVC Standard and C. Fu, et al., “SampleAdaptive Offset in the HEVC Standard,” IEEE Transactions on Circuits andSystems for Video Technology, Vol 22, No. 12, pp. 1755-1764, December2012. SAO may be applied to reconstructed pixels after application of adeblocking filter. In general, SAO involves adding an offset tocompensate for intensity shift directly to a reconstructed pixel. Thevalue of the offset depends on the local characteristics surrounding thepixel, i.e., edge direction/shape and/or pixel intensity level. Thereare two kinds of offsets that may be applied: band offsets (BO) and edgeoffsets (EO). The band offset classifies pixels by intensity interval ofthe reconstructed pixel, while edge offset classifies pixels based onedge direction and structure.

To determine band offsets, pixels are classified by intensity level ofthe corresponding reconstructed pixels. As illustrated in FIG. 1, todetermine band offsets, reconstructed pixels are classified intomultiple bands where each band contains pixels in the same intensityinterval. That is, the intensity range is equally divided into 32 bandsfrom zero to the maximum intensity value. For example, for 8-bit pixelswith values ranging from 0 to 255, the width of each band is 8, andpixel values from 8 k to 8 k+7 are in a band k, where 0≦k≦31. The offsetfor a band may be computed as an average of the differences between theoriginal pixel values and the reconstructed pixel values of the pixelsclassified into the band.

To determined edge offsets, reconstructed pixels are classified based ona one dimensional (1-D) delta calculation. That is, the pixels can befiltered in one of four edge directions (0, 90, 135, and 45) as shown inFIG. 2. For each edge direction, a pixel is classified into one of fivecategories based on the intensity of the pixel relative to neighboringpixels in the edge direction. Categories 1-4 each represent specificedge shapes as shown in FIG. 3 while category 0 is indicative that noneof these edge shapes applies. Offsets for each of categories 1-4 arealso computed after the pixels are classified.

More specifically, for each edge direction, a category number c for apixel is computed as c+sign (p0-p1)+sign (p0-p2) where p0 is the pixeland p1 and p2 are neighboring pixels, i.e., the “shaded” pixels of FIG.2. The edge conditions that result in classifying a pixel into acategory are shown in Table 1 and are also illustrated in FIG. 3. Afterthe pixels are classified, offsets are generated for each of categories1-4. The offset for a category may be computed as an average of thedifferences between the original pixel values and the reconstructedpixel values of the pixels in the region classified into the category.

TABLE 1 Category Condition 1 p0 < p1 and p0 < p2 2 (p0 < p1 and p0 = p2)or (p0 < p2 and p0 = p1) 3 (p0 > p1 and p0 = p2) or (p0 > p2 and p0 =p1) 4 p0 > p1 and p0 > p2 0 none of above

In HEVC, the determination of the SAO filter type and offsets for colorcomponents is performed at the LCU level. The encoder decides which ofthe SAO filter types is to be used for each color component, i.e., Y,Cb, and Cr, of an LCU. The encoder may use any suitable criteria forselecting the SAO filter types for the color components. For example,the encoder may decide the best SAO filter type and associated offsetsfor each color component based on a rate distortion technique thatestimates the coding cost resulting from the use of each SAO filtertype. More specifically, for each color component, the encoder mayestimate the coding costs of SAO parameters, e.g., the SAO filter typeand SAO offsets, resulting from using each of the predefined SAO filtertypes for the color component. The encoder may then select the optionwith the best coding cost for the color component. LCUs may also be“merged” for purposes of signaling SAO parameters in the compressed bitstream. In addition to directly determining the best SAO filter type andoffsets for the color components of an LCU, the encoder may alsoconsider the coding costs resulting from using the SAO parameters ofcorresponding color components in left and upper neighboring LCUs (ifthese neighboring LCUs are available). If the SAO parameters of one ofthe neighboring LCUs provide the best coding cost, one or more mergeflags (one per color component as appropriate) are signaled in thecompressed bit stream rather than directly signaling SAO parameters.

Embodiments of the invention provide for high throughput SAO filteringin video coding. More specifically, some embodiments may support 4K@60fps (frames per second) for the next generation Ultra HDTV at 100 MHzclock. In some embodiments, 64×64 blocks of pixels (the size of thelargest LCU in HEVC) may be filtered in less than 800 cycles withperformance directly scaling down based on LCU size. Some embodimentsprovide LCU level SAO filtering with a three-stage internal pipeline.Some embodiments use a novel filtering order as well as a novel scanningorder and 4×4 pixel block based processing to improve filteringperformance.

FIG. 4 is a block diagram of an SAO filter architecture 400 suitable foruse in a video encoder or a video decoder. This architecture assumes amulti-ported pool of on-chip memory shared with other components of thevideo encoder or video decoder that supports concurrent accesses by thevarious components. The unit or granularity of pixel data which isshared between various components is a block of pixels. A pixel block isa non-overlapping small rectangular region of a frame that may be 4pixels wide and 4 pixels high (4×4) for luma and 8 pixels wide and 2pixels high (8×2) for chroma. However, the pixel blocks filtered by theSAO filter architecture are 4×4, regardless of color component. Thearchitecture also assumes a shared direct memory access (DMA) componentin the video encoder or decoder which manages data transfers between theshared memory and external memory.

The architecture 400 implements SAO filtering at the LCU level ratherthan at the frame level as specified in the HEVC standard whilemaintaining compliance with the expected output of frame level SAOfiltering. The architecture 400 also assumes that the video encoder orvideo decoder performs deblocking at the LCU level. Because deblockingis also performed at the LCU level, deblocked pixel blocks from theneighboring left and bottom LCUs needed for the EO mode in SAO filteringof the right column and bottom row of a typical LCU are not available.Thus, the filtering of the right and bottom pixel blocks of an LCU isdelayed until the needed deblocked pixel blocks are available. Thearchitecture 400 is designed to handle the delay in availability ofthese pixel blocks.

More specifically, the area filtered in each LCU-based SAO cycle isshifted in the frame, i.e., the SAO processing area associated with anLCU is shifted up by one row of pixel blocks and left by one column ofpixel blocks. This shifting is illustrated by the “shaded” area in theexample of FIG. 5. Note that four sets of SAO parameters may be neededfor SAO filtering of a processing area. As is explained in more detailherein, the architecture 400 implements a buffering scheme to handle thedelay in filtering of the right and bottom pixel blocks of an LCU.

Referring again to FIG. 4, the various components of the architecture400 are now briefly described. Operation of the various componentsduring the SAO filtering process is described in more detail inreference to the methods of FIGS. 6 and 7. The controller 406 managesthe operation of various components of the SAO filter architecture 400.More specifically, the controller 406 sequences all filteringoperations, e.g., loading of deblocked pixels, filtering, andformatting. The SAO parameter buffer 422 stores SAO parameters for theLCUs to be filtered. The SAO parameter buffer 422 operates in afirst-in-first-out (FIFO) fashion. In a video decoder, as SAO parametersfor LCUs are decoded from an encoded bit stream by the decoder, thedecoder stores the parameters in the SAO parameter buffer 422 via theSAO parameter control 420. In a video encoder, the SAO parameters for anLCU are estimated by the encoder and stored in the SAO parameter buffer422 via the SAO parameter control 420.

The configuration module 402 receives various frame level parameters,e.g., height and width of the current frame, height and width of an LCUin the frame, etc., and stores these parameters in the configurationregisters 404. The SAO filter engine 408 performs the actual filteringoperation on the pixels of each pixel block. The input to the filterengine is a 3×3 block of pixel blocks formed by the pixel block to befiltered and the eight neighboring pixel blocks needed for EO mode SAOfiltering of the pixel block. This 3×3 block of pixel blocks is referredto as a filter block herein. The SAO filter engine 408 filters the 16pixels of a pixel block in parallel.

The DMA (direct memory access) interface 424 may be used by thecontroller 406 to trigger the DMA to read and write data between theshared memory pool and the off-chip memory. The memory interface 426 maybe used to read and write data between various components of thearchitecture 400 and the shared memory pool. The formatter 414 convertsfiltered luma and chrome pixel blocks to the format expected by otherparts of the encoder or decoder prior to storing the filtered pixel datain the shared memory pool. For example, the formatter 114 may performpixel block to raster conversion and interleaving of filtered Cb and Crblocks. The DBLK control 416 manages the DBLK memory 418 storing thedeblocked pixels of the LCU being filtered. The DBLK control 416receives deblocked pixel blocks and stores the pixel blocks in the DBLKmemory 418 and provides deblocked pixel blocks to the work pixel buffer410 as directed by the controller 406.

The working memory 412 stores two left work buffers of deblocked pixelblocks and a top work buffer of deblocked pixel blocks needed forfiltering the SAO processing area being processed as well as any SAOparameters needed for filtering certain pixel blocks in these buffers.The left work buffers are referred to as Left Work Buffer 0 and LeftWork Buffer 1 herein. The management and use of the three buffers isdescribed in more detail herein in reference to the method of FIG. 7.The work pixel buffer 410 is used to build the filter blocks for inputto the SAO filter engine 408. The work pixel buffer 410 is sized tosupport the pipelined load/filter/store filtering operation. Thus, thework pixel buffer 410 includes sufficient memory to store the nine pixelblocks of the filter block being processed by the SAO filter engine 408as well as additional pixel blocks for loading the next filter blocksneeded to form the subsequent filter block. The work pixel buffer 410further includes sufficient memory to store the filtered pixel blockoutput by the filtering engine and the previously filtered pixel blockto be transferred from the work pixel buffer 410 to the formatter 414.The work pixel buffer 410 includes sufficient memory to store the SAOparameters for the pixel blocks of the four LCUs included in the SAOprocessing area being filtered and sufficient memory to store certainpixel blocks for updating the work buffers.

FIGS. 6 and 7 are flow diagrams of methods for SAO filtering that may beperformed by the architecture of FIG. 4. FIG. 6 is a method of filteringa frame and FIG. 7 is a method of filtering a pixel color component ofan LCU. Although method steps may be presented and described in asequential fashion, one or more of the steps shown in the figures anddescribed herein may be performed concurrently, may be combined, and/ormay be performed in a different order than the order shown in thefigures and/or described herein. Accordingly, embodiments should not beconsidered limited to the specific ordering of steps shown in thefigures and/or described herein.

As shown in FIG. 6, when the SAO filtering of a frame of decoded videois initiated, the relevant frame parameters are received 600 in theregisters 404 of the architecture 400. These parameters are read fromthe registers 404 by the controller 406 and used to perform anyinitialization that may be needed. For example, the controller 406 mayuse the height and width of the frame and the LCU size to determine thenumber of LCUs in a frame, the number of LCUs in a row, etc. In anotherexample, a frame parameter may indicate that SAO filtering is disabledat slice and or tile boundaries. The controller 406 may then use this todisable SAO filtering at these boundaries. Disabling of SAO filteringfor boundary conditions is described below in reference to FIG. 14.Steps 602-604 illustrate the operation flow for filtering each LCU theframe and are repeated until all LCUs in the frame are processed 614.

When a deblocked LCU is ready 602 (and SAO filtering of the previous SAOprocessing area is finished), the controller 406 initiates the SAOfiltering of an SAO processing area associated with the currentdeblocked LCU. When a deblocked LCU is available in DBLK memory 418, thecontroller 406 receives a signal from the DBLK control 416. Thecontroller 406 then causes the DBLK control 416 to begin loading thedeblocked pixel blocks of the LCU into the work pixel buffer 410.Loading of deblocked pixel blocks is described in more detail inreference to the method of FIG. 7. A deblocked LCU is ready when allportions of the LCU that can be deblocked have been deblocked. Due tothe definition of deblocking in HEVC, the bottom three lines of pixelsof an LCU will not be deblocked when made available for SAO filtering.

The color components of the SAO processing are then filtered in turnaccording to the method of FIG. 7, i.e., the luminance (luma) componentis filtered 604, then the Cb component is filtered 606, and finally theCr component is filtered 608. The filtered color component pixel blocksare formatted 810 in the formatter 414, and the formatted pixel data isstored 612 in the shared memory pool via the memory interface 426. Ingeneral, the formatter 414 bypasses the filtered luma pixel blocks,i.e., the filter luma pixel blocks are stored directly into the sharedmemory pool, and interleaves the filtered Cb and Cr pixel blocks priorto storage in the shared memory pool. Because all Cb pixel blocks arefiltered before the Cr pixel blocks, the formatter 414 stores the Cbpixels blocks in an internal work memory and initiates the interleavingprocess as the Cr pixel blocks are filtered. The pixel blocks are storedin the shared memory pool in block format. In some embodiments, theformatter 414 may also convert the filtered pixel data to raster formatfor storage in the shared memory pool.

FIG. 7 is a flow diagram of a method of SAO filtering of a colorcomponent of an SAO processing area associated with an LCU. As waspreviously mentioned, the SAO processing area is the actual portion ofthe frame that will be SAO filtered when deblocked data for an LCU isavailable. The method is explained ignoring any boundary processingissues and in reference to the example of FIGS. 8-11E. FIG. 8 shows anexample frame divided into 32×32 LCUs. For purposes of explaining themethod, the assumption is made that the SAO processing area of LCU5 isbeing filtered. Thus, the SAO processing areas of LCU0-LCU4 have beenfiltered. Note that the SAO processing area of LCU5 includes the bottomright pixel block of LCU0, all the pixels blocks of the bottom row ofLCU1 except the one at the bottom right, and all of the pixel blocks ofthe rightmost column of LCU4 except the one at the bottom right.

The example of FIG. 9 shows the three work buffers stored in work memory412, a 32×32 LCU divided into 4×4 pixel blocks, and the SAO processingarea associated with the LCU, also divided into 4×4 pixel blocks. Forpurposes of the initial explanation of the method, this example isassumed to correspond to LCU5 of FIG. 8. Note that the SAO processingarea includes deblocked pixel blocks in Left Work Buffer 1. The pixelblock L11 is the bottom right pixel block of LCU0 and the pixel blocksL12-L19 are the deblocked pixel blocks of the rightmost column of LCU4.Further, Left Work Buffer 1 includes pixel block L10 which is the pixelblock of LCU0 immediately above the bottom right pixel block of LCU0.These pixel blocks were stored in Left Work Buffer 1 when the SAOprocessing area of the previous LCU, e.g., LCU4, was filtered.

In Left Work Buffer 0, the deblocked pixel block L01 is the pixel blockin LCU0 immediately to the left of the bottom right pixel block of LCU0and the deblocked pixel block L00 is the pixel block in LCU0 immediatelyto the left and above the bottom right pixel block of LCU0. Further,pixel blocks L02-L09 are the pixel blocks of the column of LCU4immediately to the left of the rightmost column of LCU4 Pixel blocksL02-L08 are completed deblocked and the top row of pixels of L09 aredeblocked. The pixel blocks were stored in Left Work Buffer 0 when theSAO processing area of the previous LCU, e.g., LCU4, was filtered.

The pixel blocks in the Top Work Buffer are the deblocked pixel blocksof the second to last row of LCU1. The pixel blocks in the Top WorkBuffer were saved in the shared memory pool when the SAO processing areaassociated with LCU1 was filtered and are retrieved from the sharedmemory pool when needed for filtering of the SAO processing areaassociated with LCU5. Note that the pixel blocks needed to populate thetop work buffers for a subsequent row of LCUs are saved in the sharedmemory pool rather than the work memory 412 as the SAO processing areasof the previous row of LCUs are filtered in order to reduce the size ofthe work memory 412.

Referring again to FIG. 7, when filtering of an SAO processing area isinitiated, the SAO parameters needed to filter the SAO processing areaare retrieved 700 by the controller 406 and stored in the work pixelbuffer 410. In addition, the pixel blocks needed to filter the top rowof pixel blocks in the SAO processing area are retrieved by thecontroller 406 from the shared memory pool and stored in the top workbuffer in the work memory 412. More specifically, the SAO parameters forthe LCU, e.g., LCU5, are retrieved 700 from the SAO parameter buffer 422by the controller 406 and stored in the work pixel buffer 410. Further,the SAO parameters needed to filter the top row of pixel blocks of theSAO processing area, e.g., L11, A11, A12, A21, A22, B11, B12, and B21 ofFIG. 9, are retrieved by the controller 406 and stored in the work pixelbuffer. The top left pixel block of the SAO processing area, e.g., L11of FIG. 9, is the bottom right pixel block of the top left neighboringLCU, e.g., LCU0 of FIG. 8, so the SAO parameters of that LCU are neededfor filtering this pixel block. The remaining pixel blocks of the toprow of the SAO processing area, e.g., A11, A12, A21, A22, B11, B12, andB21 of FIG. 9, are the bottom row of the top neighboring LCU, e.g., LCU1of FIG. 8, less the rightmost block, so the SAO parameters of that LCUare needed for filtering these pixel blocks.

The SAO parameters for the top row of the SAO processing area are storedin the shares memory pool when the SAO processing areas of the previousrow of LCUs are filtered and are retrieved by the controller 406 asneeded. Note that the pixels blocks in the left column of the SAOprocessing area, e.g., L12-L18 of FIG. 9, except for the top pixel blockare from the previous LCU, e.g., LCU3, so the SAO parameters for thisLCU are needed for filtering these pixel blocks. These SAO parametersare already in the work pixel buffer and need not be retrieved by thecontroller 406.

Referring again to FIG. 7, the pixel blocks of the first filter block tobe processed are loaded 702 into the work pixel buffer 410, Thecontroller 406 causes the needed pixel blocks to be loaded from the leftand top work buffers in the work memory 412 and/or DBLK memory 418 asneeded. For example, the first pixel block of the SAO processing areaassociated with LCU5 of FIG. 8 to be filtered will be L11 of the LeftWork Buffer 1 as shown in FIG. 9. To form the filter block for the pixelblock L11, the controller 406 causes L10, L11, and L12 to be copied fromthe Left Work Buffer 1 to the work pixel buffer 410, L00, L01, and L02to be copied from the Left Work Buffer 0 to the work pixel buffer 410,T00 to be copied from the top work buffer to the work pixel buffer 410,and A11 and A13 to be copied from DBLK memory 418 to the work pixelbuffer 410. FIG. 11A shows the content of the filter block for L11.

Once the initial filter block is ready 704, the pipelined filteringprocess begins, In this pipelined process, the following operations areperformed in parallel: the next filter block is loaded 706 into the workpixel buffer 410, the current filter block is processed by the filterengine 408 to filter 708 the current pixel block, and the previouslyfiltered pixel block is stored 710. FIG. 13 is a conceptual illustrationof this three stage pipelined filtering process. The filtering processcontinues 712 until all pixel blocks in the SAO processing area havebeen filtered.

The pixel blocks in the SAO processing area are filtered in a novel scanorder. As illustrated in the example of FIG. 12, the SAO processing areais divided into 16×16 sub-processing areas. Within a 16×16sub-processing area, the pixels blocks are filtered in raster scanorder. The 16×16 sub-processing areas are processed in Z-scan order. Theexample of FIG. 10 illustrates the detailed pixel block filtering orderfor the example 32×32 SAO processing area of FIG. 9.

Filter blocks for the pixel blocks to be filtered are loaded 706 intothe work pixel buffer 410 according to this filtering order. Further,the number of pixel blocks to be loaded for a load stage of thepipelined filtering process depends on the location of the next pixelblock to be filtered in the filtering order. For example, referring toFIG. 9 and FIGS. 11A-11E, as previously described, to filter the firstpixel block L11 in the SAO processing area, L00, L01, L02 L10, L12, T00,A11, and A12 are loaded into the work pixel buffer 410 along with L11.This filter block is shown in FIG. 11A. The next pixel block to befiltered is A11. The filter block for A11 is shown in FIG. 11B. Notethat six of the nine pixel blocks needed to form the filter block forA11, including the pixel block A11, will already be loaded in the workpixel buffer 410. Thus, the bad stage of the filter block for A11 willonly load three neighboring pixel blocks, T01, A12, and A14.

The next pixel block after A11 to be filtered is A12. The filter blockfor A12 is shown in FIG. 11C. Note that six of the nine pixel blocksneeded to form the filter block for A12, including the pixel block A12,will already be loaded in the work pixel buffer 410. Thus, the loadstage of the filter block for A12 will only load three neighboring pixelblocks, T02, A21, and A23. The next pixel block after A12 to be filteredis A21. The filter block for A21 is shown in FIG. 11D. Note that six ofthe nine pixel blocks needed to form the filter block for A21, includingthe pixel block A21, will already be loaded in the work pixel buffer410. Thus, the load stage of the filter block for A21 will only loadthree neighboring pixel blocks, T03, A22, and A24.

The next pixel block after A21 to be filtered is L12. The filter blockfor L12 is shown in FIG. 11E. Given that the work pixel buffer 410 issized to hold the current filter block and three additional pixelblocks, none of the pixel blocks need to form the filter block for L12will be in the work pixel buffer 410. Thus, the load stage of the filterblock for L12 will load all nine pixel blocks of the filter block.

Referring again to FIG. 7, the sixteen pixels of the current pixel blockare filtered 708 in parallel by the filter engine 408 and the filteredpixels are stored in the work pixel buffer 410 in the filter stage ofthe pipeline. The operation of the filter engine to filter a pixel blockis described herein in reference to FIG. 14. As previously mentioned,the work pixel buffer 410 is sized to hold the filtered pixel blockbeing generated by the filter engine and the previously filtered pixelblock. The previously filtered pixel block is stored 710 in the storestage of the pipeline. Where this previously filtered pixel block isstored depends upon which color component of the SAO processing area isbeing filtered. If the lama color component is being filtered, thefiltered pixel block bypasses the formatter 414 and is stored in theshared memory pool. If the Cb color component is being filtered, theformatter 414 stores the filtered pixel blocks in an internal memory. Ifthe Cr color component is being filtered, the formatter 414 interleavesthe filtered Cb pixel blocks and the filtered Cr pixel blocks and storesthem in the shared memory pool.

As previously mentioned, the rightmost column of pixel blocks and thebottom row of pixel blocks of the current LCU, e.g., LCU5 of FIG. 8,cannot be filtered due to unavailability of needed deblocked neighbors.The rightmost column of pixel blocks (except the bottom pixel block)will be filtered as part of the SAO filtering area of the subsequentLCU, e.g., LCU6 of FIG. 8. Thus, the pixel blocks of this rightmostcolumn, e.g., B22, B24, B42, B44, D22, D24, D42, D44, and B07 of FIG. 9,need to be stored in the Left Work Buffer 1 prior to filtering the SAOprocessing area of LCU6 as well as the last pixel block in the Top WorkBuffer, e.g., T07 of FIG. 9. The pixel blocks that will form the LeftWork Buffer 1 for the next LCU, e.g., LCU6 of FIG. 8, are copied intothe appropriate locations in this buffer in the work memory 412“on-the-fly” when certain pixel blocks of the SAO processing area of thecurrent LCU, LCU 5 of FIG. 8, are filtered.

The on-to-fly copying to the work buffer happens only when a pixel blockin the work buffer is no longer needed for filtering. For example, whenT07, B22, B24, and B42 are stored in the work pixel buffer 410 as partof one or more filter blocks, they can be copied to respective locationsL10, L11, L12, and L13 in the Left Work Buffer 1 as the contents ofthese locations in the Left Work Buffer 1 are no longer needed forfiltering of the SAO processing area. However, B44 cannot be copied toL14 the first time it is loaded into the work pixel buffer 410 as thecurrent L14 is needed for filtering of subsequent pixel blocks. B44 maybe copied to L14 the next time it is loaded into the work pixel buffer410 for filtering of D21. Note that D22, D24, D42, D44, and B07 may becopied to respective locations the Left Work Buffer 1 when initiallyloaded in the work pixel buffer 410,

To support the filtering of the rightmost column of pixel blocks, thepixel blocks in the left neighboring column of this rightmost column,e.g., B21, B23, B41, B43, D21, D23, D41, D43, and B06 of FIG. 9, need tobe stored in the Left Work Buffer 0 prior to filtering the SAOprocessing area of the subsequent LCU, e.g., LCU6 of FIG. 8, as well asthe next-to-last pixel block in the Top Work Buffer, e.g., T06 of FIG.9. The pixel blocks that will form the Left Work Buffer 0 for the nextLCU, e.g., LCU6 of FIG. 8, are copied into the appropriate locations inthis buffer in the work memory 412 “on-the-fly” when certain pixelblocks of the SAO processing area of the current LCU, e.g., LCU 5 ofFIG. 8, are filtered.

The on-to-fly copying to the work buffer happens only when a pixel blockin the work buffer is no longer needed for filtering. For example, whenT06, B21, B23, and B41 are stored in the work pixel buffer 410 as partof one or more filter blocks, they can be copied to respective locationsL00, L01, L02, and L03 in the Left Work Buffer 0 as the contents ofthese locations in the Left Work Buffer 0 are no longer needed forfiltering of the SAO processing area. However, B43 cannot be copied toL04 the first time it is loaded into the work pixel buffer 410 as thecurrent L04 is needed for filtering of subsequent pixel blocks. B43 maybe copied to L04 the next time it is loaded into the work pixel buffer410 for filtering of D12. Note that D21, D23, D41, D43, and B06 may becopied to respective locations the Left Work Buffer 0 when initiallyloaded in the work pixel buffer 410.

The bottom row of pixel blocks in the current LCU, e.g., LCU5 of FIG. 8,will be filtered as part of the SAO filtering area of the bottomneighboring LCU, e.g., LCU9 of FIG. 8. Thus, the deblocked pixel blocksof the next-to-last row of the current LCU, e.g., LCU5 of FIG. 8, arepotentially needed to filter the bottom row of pixel blocks and will bethe contents of the Top Work Buffer in the work memory 412 when the SAOfiltering area of the bottom neighboring LCU, LCU9 of FIG. 8, isprocessed. The deblocked pixel blocks of the next-to-last row of thecurrent LCU, e.g., LCU5, are copied into the appropriate locations intop work buffer in the work memory 412 “on-the-fly” when these pixelblocks are loaded into the work pixel buffer 410. Note that by the timethis next-to-last row is processed, the pixel blocks of the current TopWork Buffer are no longer needed. Thus, for example, referring to FIG.9, when the pixel block C33 is loaded into the work pixel buffer 410, itis also stored in the T00 location of the Top Work Buffer. In anotherexample, when the pixel block D43 is loaded into the work pixel buffer410, it is also stored in the T06 location of the Top Work Buffer. Notethat although D44 will not be filtered, it is loaded into the work pixelbuffer 410 when D43 is loaded as it is potentially needed for filteringD43 and is also stored in the T07 location of the Top Work Buffer.

Referring again to FIG. 7, after all the pixel blocks in the current SAOprocessing area are filtered 712, the contents of the top pixel bufferin the work memory 412 and the SAO parameters of the current LCU, e.g,LCU5 of FIG. 8, are stored in the shared memory pool for futurefiltering of the bottom row of pixel blocks in the current LCU.

FIG. 14 is a block diagram of the SAO filter engine 408 of FIG. 4. Aspreviously mention, the filter engine 408 is configured to filter allsixteen pixels of a pixel block in parallel. The filter engine 408includes an edge offset component for performing EO filtering of apixel, block and a band offset component for performing BO filtering ofa pixel block. The controller 400 knows the SAO filter type of eachpixel block and activates either the edge offset component or the bandoffset component for each pixel block based on its SAO filter type. Themultiplexor at the outputs of the two filtering components also selectsthe output of the appropriate component based on the SAO filter type ofthe pixel block being filtered,

One of the inputs to each filtering component is a set of 16 flags, onefor each pixel to be filtered. The controller 400 uses these flags tomanage filtering behavior for boundary conditions. If the flagcorresponding to a pixel is set to 1, no filtering is performed on thepixel, even if filtering is otherwise enabled for the current SAOprocessing area. The controller 400 may use these flags, for example, todisable EO filtering of pixels at the boundaries of a frame as the pixeldata needed for EO filtering of such pixels may not be available. Thecontroller 400 may also use these flags, for example, to disable EOand/or BO filtering of certain pixels if the frame parameters indicatethat SAO filtering across slice and/or tile boundaries is disabled.

To perform EO filtering, the controller 406 causes the nine pixel blocksof the current filter block in the work pixel buffer 410 to be stored inthe filter block storage of the edge offset component. Further, thecontroller provides the EO type (from the SAO parameters) for thecurrent pixel block and the 16 flags to the ALU (arithmetic logic unit)and loads the four offsets (from the SAO parameters) into four locationsof the offset buffer. The fifth location of the offset buffer is set tozero. As will be explained below, the offset buffer is indexed by theoutput of the ALU to select the offset to be added to a pixel. The fifthlocation that is set to zero is selected by an index value of zero,

The ALU receives thirty-six pixels from the filter block storage, thesixteen pixels of the current pixel block to be filtered and the twentypixels needed from the neighboring blocks in the filter block. The ALUcomputes an offset index for each of the sixteen pixels in parallel asper

offsetIdx = 2 + sign(p0-p1) + sign(p0-p2) if (offsetIdx < 3) offsetIdx =offsetIdx == 2 ? 0: offsetIdx + 1where p0 is a pixel to be filtered and p1 and p2 are neighboring pixelsselected from the thirty-six input pixels according to the specified EOtype. Further, the ALU forces the offset index to be zero for any pixelfor which the corresponding flag in the sixteen flags is set to 1,indicating that the pixel is not to be filtered.

The sixteen offset indices computed by the ALU are input to amultiplexor that selects the offset values to be added to each pixelfrom the offset buffer based on values of the offset indices. The adderadds the sixteen offset values to the sixteen pixels of the currentpixel block in parallel. The clip unit clips any pixel values thatexceed the maximum pixel value, e.g., 255, and the resulting pixel blockis stored in the work pixel buffer 410.

To perform BO filtering, the controller 406 causes the sixteen pixels ofthe current pixel block in the work pixel buffer 410 to be stored in thepixel block register storage of the band offset component. Further, thecontroller provides the band offset position (from the SAO parameters)for the current pixel block and the 16 flags to the ALU (arithmeticlogic unit) and loads the four offsets (from the SAO parameters) intofour locations of the offset buffer. The fifth location of the offsetbuffer is set to zero. As will be explained below, the offset buffer isindexed by the output of the ALU to select the offset to be added to apixel The fifth location that is set to zero is selected by an indexvalue of zero.

The ALU receives the sixteen pixels from the pixel block registerstorage and computes an offset index for each of the sixteen pixels inparallel as per

bandNum = p0 & 0xF8 >> 3 offsetIdx = bandNum − StartbandNum + 1 if(offsetIdx < 1 or offsetIdx > 5) offsetIdx = 0where p0 is a pixel to be filtered and StartbandNum is the band offsetposition (BOPos). Further, the ALU forces the offset index to be zerofor any pixel for which the corresponding flag in the sixteen flags isset to 1, indicating that the pixel is not to be filtered.

The sixteen offset indices computed by the ALU are input to amultiplexor that selects the offset values to be added to each pixelfrom the offset buffer based on values of the offset indices. The adderadds the sixteen offset values to the sixteen pixels of the currentpixel block in parallel. The clip unit clips any pixel values thatexceed the maximum pixel value, e.g., 255, and the resulting pixel blockis stored in the work pixel buffer 410.

Other Embodiments

While the invention has been described with respect to a limited numberof embodiments, those skilled in the art, having benefit of thisdisclosure, will appreciate that other embodiments can be devised whichdo not depart from the scope of the invention as disclosed herein.

For example, embodiments have been described herein assuming that thepixel blocks are 4×4. One of ordinary skill in the art will understandembodiments in which the size of the pixel blocks is different.

In another example, embodiments have been described herein assuming thatthe sub-processing areas of an SAO processing area are 16×16. One ofordinary skill in the art will understand embodiments in which thesub-processing areas are larger, e.g., 32×32.

In another example, embodiments have been described herein in which thefilter engine includes separate components for EO and BO filtering. Oneof ordinary skill in the art will understand embodiments in which thedesign of the filtering engine is unified such the offset buffer,multiplexor, adder, and clip unit are used for both EQ and BO filteringand two ALUs are provided, selected by SAO type, one for EO and one forBO.

In another example, embodiments have been described herein in which theLCUs in a frame are filtered in raster scan order. One of ordinary skillin the art will understand embodiments in which tiling is enabled andthe LCUs are processed tile by tile. In such embodiments, left workbuffers may be stored in the shared memory pool as well as the top workbuffers and retrieved as needed.

In another example, one of ordinary skill in the art will understandembodiments in which the filter engine may be replicated to allowparallel SAO filtering of lura, Cb, and Cr pixel blocks.

In another example, one of ordinary skill in the art will understandembodiments in which the some or all of the work memory is outside ofthe SAO architecture, e.g., in an on-chip memory or an external memory.

In another example, one of ordinary skill in the art will understandembodiments in which the SAO architecture has a single unified bufferrather than a separate work pixel buffer and a separate work memory.

In another example, one of ordinary skill in the art will understandembodiments in which the scan order of the pixel blocks in an SAOprocessing area is different than that described above. For example, thepixel blocks may be scanned row-by-row in raster scan order orcolumn-by-column in which each column is scanned top-to-bottom.

It is therefore contemplated that the appended claims will cover anysuch modifications of the embodiments as fail within the true scope ofthe invention.

What is claimed is:
 1. A method for sample adaptive offset (SAO)filtering of largest coding units (LCUs) of a video frame in an SAOcomponent, the method comprising: receiving, by the SAO component, anindication that deblocked pixel blocks of an LCU are available; andapplying SAO filtering, by the SAO component, to each pixel block ofpixel blocks of an SAO processing area corresponding to the LCUresponsive to the indication, wherein pixels of each pixel block of theSAO processing area are filtered in parallel.
 2. The method of claim 1,wherein a pixel block is a 4×4 block of pixels.
 3. The method of claim10, wherein a pixel block is one selected from a group consisting of aluminance pixel block, a Cr pixel block, and a Cb pixel block.
 4. Themethod of claim 1, wherein applying SAO filtering comprises: filteringeach pixel block of the SAO processing area according to a scan order inwhich the SAO processing area is divided into non-overlappingsub-processing areas that are scanned in Z-scan order and pixel blockswithin a sub-processing area block are scanned in raster scan order. 5.The method of claim 4, wherein a sub-processing area is a 16×16 block ofpixels.
 6. The method of claim 1, wherein applying SAO filteringcomprises filtering the pixel blocks in the SAO processing area in ascan order selected from a group consisting of raster scan order andcolumn by column scan order in which each column is scanned top tobottom.
 7. (canceled)
 8. The method of claim 7, wherein the pixel blocksof the LCU in the SAO processing area are stored in a memory comprisedin the SAO component, the pixel blocks of a left neighboring LCU arestored in a first work buffer comprised in the SAO component, pixelblocks of a left neighboring column of a rightmost column of pixelblocks of the left neighboring LCU are stored in a second work buffercomprised in the SAO component, and pixel blocks of a top neighboringrow of a bottom row of pixel blocks of a top neighboring LCU are storedin a third work buffer comprised in the SAO component.
 9. The method ofclaim 8, wherein applying SAO filtering comprises: storing pixel blocksof a rightmost column of pixel blocks of the LCU in the first workbuffer; storing pixel blocks of a left neighboring column of pixelblocks of the rightmost column in the second work buffer; and storingpixel blocks of a top neighboring row of pixel blocks of the bottom rowof pixel blocks of the LCU in the third work buffer.
 10. An apparatusfor sample adaptive offset (SAO) filtering, the apparatus comprising: amemory; a controller coupled to the memory and configured to sequenceloading of pixel blocks of an SAO processing area into the memory,filtering of the pixel blocks by a filter engine, and storing of thefiltered pixel blocks, wherein the SAO processing area corresponds to alargest coding unit (LCU) of a video frame, and wherein the loading,filtering, and storing is performed responsive to an indication thatdeblocked pixel blocks of the LCU are available; and the filter enginecoupled to the controller and the memory, wherein the filter engine isconfigured to apply SAO filtering to a pixel block of the SAO processingarea stored in the memory, wherein all pixels in the pixel block arefiltered in parallel.
 11. The apparatus of claim 10, wherein a pixelblock is a 4×4 block of pixels.
 12. The apparatus of claim 10, wherein apixel block is one selected from a group consisting of a luminance pixelblock, a Cr pixel block, and a Cb pixel block.
 13. The apparatus ofclaim 10, wherein the controller is configured to load pixel blocks ofthe SAO processing area into the memory for filtering by the filterengine according to a scan order in which the SAO processing area isdivided into non-overlapping sub-processing areas that are scanned inZ-scan order and pixel blocks within a sub-processing area block arescanned in raster scan order.
 14. The apparatus of claim 13, wherein asub-processing area is a 16×16 block of pixels.
 15. The apparatus ofclaim 10, wherein the controller is configured to load pixel blocks ofthe SAO processing area into the memory for filtering by the filterengine in a scan order selected from a group consisting of raster scanorder and column by column scan order in which each column is scannedtop to bottom.
 16. (canceled)
 17. The apparatus of claim 10, wherein thememory comprises a first work buffer for storing pixel blocks of a leftneighboring LCU, a second work buffer for storing pixel blocks of a leftneighboring column of a rightmost column of pixel blocks of the leftneighboring LCU, and a third work buffer for storing pixel blocks of atop neighboring row of a bottom row of pixel blocks of a top neighboringLCU.
 18. The apparatus of claim 17, wherein the controller is configuredto cause first pixel blocks of a rightmost column of pixel blocks of theLCU to be stored in the first work buffer, second pixel blocks of a leftneighboring column of pixel blocks of the rightmost column to be storedin the second work buffer, and third pixel blocks of a top neighboringrow of pixel blocks of the bottom row of pixel blocks of the LCU to bestored in the third work buffer.
 19. An apparatus for sample adaptiveoffset (SAO) filtering of largest coding units (LCUs) of a video framein an SAO component, the method comprising: circuitry for receiving, bythe SAO component, an indication that deblocked pixel blocks of an LCUare available; and circuitry for applying SAO filtering, by the SAOcomponent, to each pixel block of pixel blocks of an SAO processing areacorresponding to the LCU responsive to the indication, wherein pixels ofeach pixel block of the SAO processing area are filtered in parallel,20. The apparatus of claim 19, wherein a pixel block is one selectedfrom a group consisting of a luminance pixel block, a Cr pixel block,and a Cb pixel block.
 21. The apparatus of claim 19, wherein applyingSAO filtering comprises: filtering each pixel block of the SAOprocessing area according to a scan order in which the SAO processingarea is divided into non-overlapping sub-processing areas that arescanned in Z-scan order and pixel blocks within a sub-processing areablock are scanned in raster scan order.
 22. The apparatus of claim 19,wherein applying SAO filtering comprises filtering the pixel blocks inthe SAO processing area in a scan order selected from a group consistingof raster scan order and column by column scan order in which eachcolumn is scanned top to bottom.
 23. The apparatus of claim 22, whereinthe pixel blocks of the LCU in the SAO processing area are stored in amemory comprised in the SAO component, the pixel blocks of a leftneighboring LCU are stored in a first work buffer comprised in the SAOcomponent, pixel blocks of a left neighboring column of a rightmostcolumn of pixel blocks of the left neighboring LCU are stored in asecond work buffer comprised in the SAO component, and pixel blocks of atop neighboring row of a bottom row of pixel blocks of a top neighboringLCU are stored in a third work buffer comprised in the SAO component.24. The apparatus of claim 23, wherein applying SAO filtering comprises:storing pixel blocks of a rightmost column of pixel blocks of the LCU inthe first work buffer; storing pixel blocks of a left neighboring columnof pixel blocks of the rightmost column in the second work buffer; andstoring pixel blocks of a top neighboring row of pixel blocks of thebottom row of pixel blocks of the LCU in the third work buffer.